LLM Fairness Dashboard

Bank Complaint Handling Fairness Analysis

Generated: 2025-09-20T22:54:22.455519 | Total Experiments: 1,000

0.139
Zero-Shot Accuracy
0.221
N-Shot Accuracy
1,000
Sample Size

Key Findings with Practical Importance

8 findings that are both statistically significant and practically important

These results represent real, meaningful differences that impact fairness in complaint handling.

#1 Severity and Bias → Tier Impact Large Effect

Persona injection bias differs between severity levels (χ² = 89.231)

Test: Tier Impact Rate: Zero-Shot
p-value: < 0.0001
Effect Size: cohens_d = 2.306
Sample: n = 12042
What this means:

There is strong evidence that bias is greater for more severe cases.

#2 Persona Injection → Geographic Bias Large Effect

Geographic bias differs between zero-shot and n-shot methods (F = 745.900)

Test: Geographic Bias Consistency: Zero-Shot vs N-Shot
p-value: < 0.0001
Effect Size: eta_squared = 0.882
Sample: n = 0
What this means:

Geographic bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

#3 Persona Injection → Gender Bias Large Effect

Gender bias differs between zero-shot and n-shot methods (F = 149.481)

Test: Gender Bias Consistency: Zero-Shot vs N-Shot
p-value: < 0.0001
Effect Size: eta_squared = 0.599
Sample: n = 0
What this means:

Gender bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

#4 Persona Injection → Geographic Bias Large Effect

Zero-tier proportions differ across geographies (χ² = 4097.203)

Test: Tier 0 Rate by Geography: Zero-Shot
p-value: < 0.0001
Effect Size: cramers_v = 0.507
Sample: n = 15961
What this means:

The proportion of zero-tier cases differs significantly between geographies, with suburban upper middle having the highest proportion.

#5 Persona Injection → Ethnicity Bias Large Effect

Ethnicity bias differs between zero-shot and n-shot methods (F = 63.728)

Test: Ethnicity Bias Consistency: Zero-Shot vs N-Shot
p-value: < 0.0001
Effect Size: eta_squared = 0.389
Sample: n = 0
What this means:

Ethnicity bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

#6 Persona Injection → Geographic Bias Medium Effect

Zero-tier proportions differ across geographies (χ² = 1458.092)

Test: Tier 0 Rate by Geography: N-Shot
p-value: < 0.0001
Effect Size: cramers_v = 0.120
Sample: n = 100462
What this means:

The proportion of zero-tier cases differs significantly between geographies, with suburban working having the highest proportion.

#7 Persona Injection → Geographic Bias Large Effect

Mean tier differs significantly across geographies (F = 123.039)

Test: Mean Tier Comparison Across Geographies
p-value: < 0.0001
Effect Size: eta_squared = 0.072
Sample: n = 0
What this means:

There is strong evidence that the LLM's recommended tiers differ significantly between geographies in Zero-Shot. Means: rural=1.165, rural_poor=1.245, rural_upper_middle=1.266, rural_working=1.243, suburban_poor=0.812, suburban_upper_middle=0.802, suburban_working=0.791, urban_affluent=1.285, urban_poor=1.312, urban_upper_middle=1.093, urban_working=1.143

#8 Bias Mitigation → Strategy Effectiveness Medium Effect

Bias mitigation strategies differ significantly in effectiveness (F = 27.936)

Test: Bias Mitigation Strategy Comparison: Zero-Shot
p-value: < 0.0001
Effect Size: eta_squared = 0.046
Sample: n = 0
What this means:

There is strong evidence that bias mitigation strategies differ in effectiveness.

Statistically Significant but Practically Trivial Findings

13 findings that are statistically significant but have negligible practical impact

⚠️ Interpretation Warning: These results likely reflect large sample sizes detecting tiny differences that don't meaningfully impact fairness. They should generally not drive decision-making.

#1 Severity and Bias → Tier Impact

Persona injection affects tier selection bias (χ² = 1713.322)

Test: Tier Impact Rate: Persona-Injected vs Baseline
p-value: < 0.0001
Effect Size: cohens_h = 0.148
Sample: n = 168000
⚠️ Small effect size (0.148) suggests minimal practical importance
Why this is likely trivial:

With n = 168000, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#2 Persona Injection → Gender Bias

Zero-tier proportions differ across genders (χ² = 23.742)

Test: Tier 0 Rate by Gender: N-Shot
p-value: < 0.0001
Effect Size: cohens_h = 0.078
Sample: n = 15960
⚠️ Small effect size (0.078) suggests minimal practical importance
Why this is likely trivial:

With n = 15960, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#3 Persona Injection → Gender Bias Small Effect

female personas receive lower tier assignments than male personas (difference: 0.027)

Test: Mean Tier Difference: female vs male
p-value: 0.0021
Effect Size: cohens_d = -0.049
Sample: n = 0
⚠️ Small effect size (-0.049) suggests minimal practical importance
Why this is likely trivial:

With n = 0, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#4 Persona Injection → Gender Bias Small Effect

female personas receive higher tier assignments than male personas (difference: 0.025)

Test: Mean Tier Difference: female vs male
p-value: 0.0029
Effect Size: cohens_d = 0.047
Sample: n = 0
⚠️ Small effect size (0.047) suggests minimal practical importance
Why this is likely trivial:

With n = 0, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#5 Persona Injection → Gender Bias Small Effect

Tier distribution differs significantly between gender groups (χ² = 24.313)

Test: Tier Distribution Comparison: N-Shot
p-value: < 0.0001
Effect Size: cramers_v = 0.039
Sample: n = 15960
⚠️ Small effect size (0.039) suggests minimal practical importance
Why this is likely trivial:

With n = 15960, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#6 Persona Injection → Ethnicity Bias Small Effect

Tier distribution differs significantly between ethnicity groups (χ² = {chi2:.3f})

Test: Tier Distribution Comparison: {ethnicity1} vs {ethnicity2}
p-value: 0.0010
Effect Size: cramers_v = 0.027
Sample: n = 15961
⚠️ Small effect size (0.027) suggests minimal practical importance
Why this is likely trivial:

With n = 15961, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#7 Persona Injection → Gender Bias Small Effect

Tier distribution differs significantly between gender groups (χ² = 9.663)

Test: Tier Distribution Comparison: Zero-Shot
p-value: 0.0080
Effect Size: cramers_v = 0.025
Sample: n = 15961
⚠️ Small effect size (0.025) suggests minimal practical importance
Why this is likely trivial:

With n = 15961, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#8 Persona Injection → Geographic Bias Small Effect

Mean tier differs significantly across geographies (F = 82.789)

Test: Mean Tier Comparison Across Geographies
p-value: < 0.0001
Effect Size: eta_squared = 0.008
Sample: n = 0
⚠️ Small effect size (0.008) suggests minimal practical importance
Why this is likely trivial:

With n = 0, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#9 Persona Injection → Ethnicity Bias Small Effect

Mean tier differs significantly between ethnicity groups (F = 5.299)

Test: Mean Tier Comparison: ANOVA
p-value: 0.0012
Effect Size: eta_squared = 0.001
Sample: n = 0
⚠️ Small effect size (0.001) suggests minimal practical importance
Why this is likely trivial:

With n = 0, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#10 Persona Injection → Geographic Bias Small Effect

Tier distribution differs significantly across geographies (χ² = {stats.get("chi2_statistic", 0):.3f})

Test: Tier Distribution Comparison Across Geographies
p-value: < 0.0001
Effect Size: cramers_v = 0.000
Sample: n = 15961
⚠️ Small effect size (0.000) suggests minimal practical importance
Why this is likely trivial:

With n = 15961, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#11 Persona Injection → Geographic Bias Small Effect

Tier distribution differs significantly across geographies (χ² = {stats.get("chi2_statistic", 0):.3f})

Test: Tier Distribution Comparison Across Geographies
p-value: < 0.0001
Effect Size: cramers_v = 0.000
Sample: n = 100462
⚠️ Small effect size (0.000) suggests minimal practical importance
Why this is likely trivial:

With n = 100462, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#12 Persona Injection → Geographic Bias Small Effect

Question rate differs significantly across geographies (χ² = {stats.get("chi2_statistic", 0):.3f})

Test: Question Rate Comparison Across Geographies
p-value: < 0.0001
Effect Size: cramers_v = 0.000
Sample: n = 15961
⚠️ Small effect size (0.000) suggests minimal practical importance
Why this is likely trivial:

With n = 15961, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

#13 Persona Injection → Geographic Bias Small Effect

Question rate differs significantly across geographies (χ² = {stats.get("chi2_statistic", 0):.3f})

Test: Question Rate Comparison Across Geographies
p-value: 0.0275
Effect Size: cramers_v = 0.000
Sample: n = 100462
⚠️ Small effect size (0.000) suggests minimal practical importance
Why this is likely trivial:

With n = 100462, even tiny differences become statistically significant. The effect size indicates this difference is too small to matter in practice.

Tier Recommendations

Result 1: Confusion Matrix – Zero Shot
Persona Tier
Baseline012
0015711
118,343855
274692,199
Result 2: Confusion Matrix – N-Shot
Persona Tier
Baseline012
06031,1232
14037,537366
273461,656
Result 3: Tier Impact Rate
LLM Method Same Tier Different Tier Total % Different
n shot 9,796 2,247 12,043 18.7%
zero shot 10,542 1,500 12,042 12.5%
Total 20,338 3,747 24,085 15.6%

Statistical Analysis

Hypothesis: H0: persona-injection does not affect tier selection

Test: Chi-squared test of independence

Effect Size (Cramér's V): 0.085 (negligible)

Test Statistic: χ²(1) = 175.813

p-value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: While statistically significant, the effect of persona injection on tier selection is practically trivial and likely due to large sample size.

Result 4: Mean Tier – Persona-Injected vs. Baseline
LLM Method Mean Baseline Tier Mean Persona Tier N Std Dev SEM
n shot 1.02 1.08 12,043 0.43 0.0039
zero shot 1.21 1.25 12,042 0.36 0.0032

Statistical Analysis (N Shot):

H0: The mean tier is the same with and without persona injection

Test: Paired t-test

Effect Size: 0.141 (negligible)

Mean Difference: +0.06 (from 1.02 to 1.08)

Test Statistic: t(12042) = 15.4588

p-value: < 0.0001

Conclusion: The null hypothesis was rejected (p < 0.05).

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's recommended tier is higher when it sees humanizing attributes, somewhat analogous to a display of empathy.

Statistical Analysis (Zero Shot):

H0: The mean tier is the same with and without persona injection

Test: Paired t-test

Effect Size: 0.128 (negligible)

Mean Difference: +0.05 (from 1.21 to 1.25)

Test Statistic: t(12041) = 14.0656

p-value: < 0.0001

Conclusion: The null hypothesis was rejected (p < 0.05).

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's recommended tier is higher when it sees humanizing attributes, somewhat analogous to a display of empathy.

Result 5: Tier Distribution – Persona-Injected vs. Baseline
MethodTier 0Tier 1Tier 2
Baseline79728193
Persona Injected2,88421,6757,362

Statistical Analysis:

Hypothesis: H0: The tier distribution is independent of persona injection.

Test: Chi-squared test of independence

Effect Size (Cramér's V): 0.018 (negligible)

Test Statistic: χ²(2) = 10.79

p-value: 0.0045

Conclusion: The null hypothesis was rejected (p < 0.05).

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: While statistically significant, the effect of persona injection on tier distribution is practically trivial and likely due to large sample size.

Process Bias

Result 1: Question Rate – Persona-Injected vs. Baseline – Zero-Shot
Condition Count Questions Question Rate %
Baseline 500 29 5.8%
Persona-Injected 15,961 542 3.4%

Statistical Analysis:

H0: The question rate is the same with and without persona injection

Test: Chi-squared test of independence

Effect Size: 0.022 (negligible)

Test Statistic: χ²(1) = 7.67

p-value: 0.0056

Conclusion: The null hypothesis was rejected (p < 0.05).

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM is significantly less likely to ask questions when it sees humanizing attributes, suggesting reduced engagement or scrutiny.

Result 2: Question Rate – Persona-Injected vs. Baseline – N-Shot
Condition Count Questions Question Rate %
Baseline 500 0 0.0%
Persona-Injected 15,960 24 0.2%

Statistical Analysis:

H0: The question rate is the same with and without persona injection

Test: Chi-squared test of independence

Effect Size: 0.002 (negligible)

Test Statistic: χ²(1) = 0.07

p-value: 0.7851

Conclusion: The null hypothesis was not rejected (p ≥ 0.05).

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: The LLM's question rate is not significantly affected by humanizing attributes.

Result 3: N-Shot versus Zero-Shot
Method Count Questions Question Rate %
Zero-Shot 15,961 542 3.4%
N-Shot 15,960 24 0.2%

Statistical Analysis:

H0: The question rate is the same with and without N-Shot examples

Test: Chi-squared test of independence

Effect Size: 0.123 (small to medium)

Test Statistic: χ²(1) = 480.73

p-value: < 0.0001

Conclusion: The null hypothesis was rejected (p < 0.05).

Practical Significance: This result is statistically significant and practically modest.

Implication: N-Shot examples reduce the influence of sensitive personal attributes on the LLM's questioning behavior.

Gender Bias

Result 1: Mean Tier by Gender and by Zero-Shot/N-Shot

Zero-Shot Mean Tier by Gender

Gender Mean Tier Count Std Dev
Female 1.216 7,984 0.540
Male 1.191 7,977 0.533

Statistical Analysis - Zero-Shot

Hypothesis: H0: Persona injection does not affect mean tier assignment

Test: Paired t-test

Effect Size: 0.047 (negligible)

Mean Difference: 0.025

Test Statistic: t(15959) = 2.975

p-value: 0.0029

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's mean recommended tier is biased by gender, disadvantaging males.

N-Shot Mean Tier by Gender

Gender Mean Tier Count Std Dev
Female 1.064 7,982 0.566
Male 1.090 7,978 0.541

Statistical Analysis - N-Shot

Hypothesis: H0: Persona injection does not affect mean tier assignment

Test: Paired t-test

Effect Size: -0.049 (negligible)

Mean Difference: -0.027

Test Statistic: t(15958) = -3.077

p-value: 0.0021

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's mean recommended tier is biased by gender, disadvantaging females.

Result 2: Tier Distribution by Gender and by Zero-Shot/N-Shot

Zero-Shot Tier Distribution by Gender

GenderTier 0Tier 1Tier 2
Female4865,2862,212
Male5165,4222,039

Statistical Analysis - Zero-Shot

Hypothesis: H0: Persona injection does not affect the distribution of tier assignments

Test: Chi-squared test

Effect Size: 0.025 (negligible)

Test Statistic: χ²(2) = 9.663

p-value: 0.0080

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's recommended tiers are biased by gender.

N-Shot Tier Distribution by Gender

GenderTier 0Tier 1Tier 2
Female1,0415,3931,548
Male8415,5741,563

Statistical Analysis - N-Shot

Hypothesis: H0: Persona injection does not affect the distribution of tier assignments

Test: Chi-squared test

Effect Size: 0.039 (negligible)

Test Statistic: χ²(2) = 24.313

p-value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: The LLM's recommended tiers are biased by gender.

Result 3: Tier Bias Distribution by Gender and by Zero-Shot/N-Shot
Gender Count Mean Zero-Shot Tier Mean N-Shot Tier
Female 15,966 1.216 1.064
Male 15,955 1.191 1.090

Statistical Analysis

Hypothesis: H0: Gender bias is consistent between zero-shot and n-shot methods (no interaction effect)

Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id

Effect Size (Partial η²): 0.599 (large)

Test Statistic: F = 149.481

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically substantial.

Implication: Gender bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

Result 4: Question Rate – Persona-Injected vs. Baseline – by Gender and by Zero-Shot/N-Shot

Zero-Shot Question Rate by Gender

Gender Count Questions Question Rate %
Female 7,984 290 3.6%
Male 7,977 252 3.2%

Statistical Analysis - Zero-Shot

Hypothesis: H0: The question rate is the same across genders

Test: Chi-squared test of independence

Effect Size: 0.013 (negligible)

Rate Difference: 0.5%

Test Statistic: χ²(1) = 2.581

p-value: 0.1081

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the LLM's questioning behavior is biased by gender.

N-Shot Question Rate by Gender

Gender Count Questions Question Rate %
Female 7,982 13 0.2%
Male 7,978 11 0.1%

Statistical Analysis - N-Shot

Hypothesis: H0: The question rate is the same across genders

Test: Chi-squared test of independence

Effect Size: 0.002 (negligible)

Rate Difference: 0.0%

Test Statistic: χ²(1) = 0.041

p-value: 0.8391

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the LLM's questioning behavior is biased by gender.

Result 5: Disadvantage Ranking by Gender and by Zero-Shot/N-Shot
Ranking Zero-Shot N-Shot
Most Advantaged Female Male
Most Disadvantaged Male Female

Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.

Result 6: Tier 0 Rate by Gender - Zero Shot
Gender Sample Size Zero Tier Proportion Zero
Female 7,984 486 0.061
Male 7,977 516 0.065

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all genders

Test: Chi-squared test on counts

Effect Sizes:

  • Proportion Difference (Cohen's h): -0.016 (negligible)
  • Risk Ratio: 0.94 (female vs male)
  • Association (Cramér's V): 0.008

Test Statistic: χ² = 0.923

p-Value: 0.337

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the proportion of zero-tier cases varies with gender.

Result 7: Tier 0 Rate by Gender - N-Shot
Gender Sample Size Zero Tier Proportion Zero
Female 7,982 1,041 0.130
Male 7,978 841 0.105

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all genders

Test: Chi-squared test on counts

Effect Sizes:

  • Proportion Difference (Cohen's h): 0.078 (negligible)
  • Risk Ratio: 1.24 (female vs male)
  • Association (Cramér's V): 0.039

Test Statistic: χ² = 23.742

p-Value: 0.000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: While statistically significant, the difference in zero-tier proportions between genders is practically trivial and likely due to large sample size.

Ethnicity Bias

Result 1: Mean Tier by Ethnicity and by Zero-Shot/N-Shot

Zero-Shot Mean Tier by Ethnicity

EthnicityMean TierCountStd Dev
Asian 1.230 3,965 0.540
Black 1.205 3,989 0.547
Latino 1.183 4,019 0.530
White 1.197 3,988 0.527

Statistical Analysis

Hypothesis: H0: The mean tier is the same across all ethnicities

Test: One-way ANOVA

Comparison: All ethnicities: asian, black, latino, white

Effect Size: 0.001 (negligible)

Test Statistic: F = 5.299

p-Value: 0.0012

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that the LLM's recommended tiers differ significantly between ethnicities in Zero-Shot. Means: asian=1.230, black=1.205, latino=1.183, white=1.197

N-Shot Mean Tier by Ethnicity

EthnicityMean TierCountStd Dev
Asian 1.076 4,024 0.551
Black 1.080 3,979 0.548
Latino 1.080 3,978 0.553
White 1.071 3,979 0.564

Statistical Analysis

Hypothesis: H0: The mean tier is the same across all ethnicities

Test: One-way ANOVA

Comparison: All ethnicities: asian, black, latino, white

Effect Size: 0.000 (negligible)

Test Statistic: F = 0.227

p-Value: 0.8777

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the LLM's recommended tiers differ between ethnicities in N-Shot. Means: asian=1.076, black=1.080, latino=1.080, white=1.071

Result 2: Tier Distribution by Ethnicity and by Zero-Shot/N-Shot

Zero-Shot Tier Distribution by Ethnicity

EthnicityTier 0Tier 1Tier 2
Asian2272,6001,138
Black2732,6271,089
Latino2632,757999
White2392,7241,025

Statistical Analysis

Hypothesis: H0: The tier distribution is the same across ethnicities

Test: Chi-squared test of independence

Effect Size: 0.027 (negligible)

Test Statistic: χ² = 22.531

Degrees of Freedom: 6

p-Value: 0.0010

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that the tier distribution differs significantly between ethnicities in Zero-Shot.

N-Shot Tier Distribution by Ethnicity

EthnicityTier 0Tier 1Tier 2
Asian4702,777777
Black4502,761768
Latino4622,734782
White5002,695784

Statistical Analysis

Hypothesis: H0: The tier distribution is the same across ethnicities

Test: Chi-squared test of independence

Effect Size: 0.011 (negligible)

Test Statistic: χ² = 4.124

Degrees of Freedom: 6

p-Value: 0.6599

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the tier distribution differs between ethnicities in N-Shot.

Result 3: Tier Bias Distribution by Ethnicity and by Zero-Shot/N-Shot
EthnicityCountMean Zero-Shot TierMean N-Shot Tier
Asian 7,989 1.230 1.076
Black 7,968 1.205 1.080
Latino 7,997 1.183 1.080
White 7,967 1.197 1.071

Note: Mean tiers are calculated from persona-injected experiments only (excluding bias mitigation).

Statistical Analysis

Hypothesis: H0: Ethnicity bias is consistent between zero-shot and n-shot methods (no interaction effect)

Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id

Effect Size (Partial η²): 0.389 (large)

Test Statistic: F = 63.728

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically substantial.

Implication: Ethnicity bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

Result 4: Question Rate – Persona-Injected vs. Baseline – by Ethnicity and by Zero-Shot/N-Shot

Zero-Shot Question Rate by Ethnicity

EthnicityQuestionsTotalQuestion Rate
Asian 142 3,965 3.6%
Black 149 3,989 3.7%
Latino 128 4,019 3.2%
White 123 3,988 3.1%

Statistical Analysis

Hypothesis: H0: The question rate is the same across ethnicities

Test: Chi-squared test of independence

Effect Size: 0.015 (negligible)

Test Statistic: χ² = 3.542

Degrees of Freedom: 3

p-Value: 0.3153

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the question rate differs between ethnicities in Zero-Shot.

N-Shot Question Rate by Ethnicity

EthnicityQuestionsTotalQuestion Rate
Asian 8 4,024 0.2%
Black 6 3,979 0.2%
Latino 5 3,978 0.1%
White 5 3,979 0.1%

Statistical Analysis

Hypothesis: H0: The question rate is the same across ethnicities

Test: Chi-squared test of independence

Effect Size: 0.008 (negligible)

Test Statistic: χ² = 0.952

Degrees of Freedom: 3

p-Value: 0.8129

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the question rate differs between ethnicities in N-Shot.

Result 5: Disadvantage Ranking by Ethnicity and by Zero-Shot/N-Shot
Ranking Zero-Shot N-Shot
Most Advantaged Asian Latino
Most Disadvantaged Latino White

Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.

Result 6: Tier 0 Rate by Ethnicity - Zero Shot
Ethnicity Sample Size Zero Tier Proportion Zero
Asian 3,965 227 0.057
Black 3,989 273 0.068
Latino 4,019 263 0.065
White 3,988 239 0.060

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all ethnicities

Test: Chi-squared test on counts

Effect Size: 0.018 (negligible)

Test Statistic: χ² = 5.264

p-Value: 0.153

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the proportion of zero-tier cases varies with ethnicity.

Result 7: Tier 0 Rate by Ethnicity - N-Shot
Ethnicity Sample Size Zero Tier Proportion Zero
Asian 4,024 470 0.117
Black 3,979 450 0.113
Latino 3,978 462 0.116
White 3,979 500 0.126

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all ethnicities

Test: Chi-squared test on counts

Effect Size: 0.014 (negligible)

Test Statistic: χ² = 3.353

p-Value: 0.340

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that the proportion of zero-tier cases varies with ethnicity.

Geographic Bias

Result 1: Mean Tier by Geography and by Zero-Shot/N-Shot

Zero-Shot Mean Tier by Geography

GeographyMean TierCountStd Dev
Rural 1.165 4,000 0.371
Rural Poor 1.245 506 0.689
Rural Upper Middle 1.266 489 0.689
Rural Working 1.243 507 0.716
Suburban Poor 0.812 501 0.743
Suburban Upper Middle 0.802 474 0.766
Suburban Working 0.791 497 0.733
Urban Affluent 1.285 4,000 0.452
Urban Poor 1.312 4,000 0.463
Urban Upper Middle 1.093 496 0.696
Urban Working 1.143 491 0.711

Statistical Analysis

Hypothesis: H0: The mean tier is the same across all geographies

Test: One-way ANOVA

Comparison: All geographies: rural, rural_poor, rural_upper_middle, rural_working, suburban_poor, suburban_upper_middle, suburban_working, urban_affluent, urban_poor, urban_upper_middle, urban_working

Effect Size: 0.072 (negligible)

Test Statistic: F = 123.039

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that the LLM's recommended tiers differ significantly between geographies in Zero-Shot. Means: rural=1.165, rural_poor=1.245, rural_upper_middle=1.266, rural_working=1.243, suburban_poor=0.812, suburban_upper_middle=0.802, suburban_working=0.791, urban_affluent=1.285, urban_poor=1.312, urban_upper_middle=1.093, urban_working=1.143

N-Shot Mean Tier by Geography

GeographyMean TierCountStd Dev
Rural 1.037 32,038 0.520
Rural Poor 1.229 529 0.695
Rural Upper Middle 1.250 553 0.691
Rural Working 1.211 536 0.682
Suburban Poor 0.823 542 0.697
Suburban Upper Middle 0.928 553 0.641
Suburban Working 0.807 522 0.684
Urban Affluent 1.062 32,049 0.520
Urban Poor 1.113 32,055 0.481
Urban Upper Middle 1.118 536 0.589
Urban Working 1.064 549 0.649

Statistical Analysis

Hypothesis: H0: The mean tier is the same across all geographies

Test: One-way ANOVA

Comparison: All geographies: rural, rural_poor, rural_upper_middle, rural_working, suburban_poor, suburban_upper_middle, suburban_working, urban_affluent, urban_poor, urban_upper_middle, urban_working

Effect Size: 0.008 (negligible)

Test Statistic: F = 82.789

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that the LLM's recommended tiers differ significantly between geographies in N-Shot. Means: rural=1.037, rural_poor=1.229, rural_upper_middle=1.250, rural_working=1.211, suburban_poor=0.823, suburban_upper_middle=0.928, suburban_working=0.807, urban_affluent=1.062, urban_poor=1.113, urban_upper_middle=1.118, urban_working=1.064

Result 2: Tier Distribution by Geography and by Zero-Shot/N-Shot

Zero-Shot Tier Distribution by Geography

GeographyTier 0Tier 1Tier 2
Rural03,340660
Rural Poor73236197
Rural Upper Middle68223198
Rural Working83218206
Suburban Poor194207100
Suburban Upper Middle195178101
Suburban Working19620992
Urban Affluent02,8591,141
Urban Poor02,7531,247
Urban Upper Middle99252145
Urban Working94233164

Statistical Analysis

Hypothesis: H0: The tier distribution is the same across geographies

Test: Chi-squared test of independence

Effect Size: 0.382 (large)

Test Statistic: χ² = 4655.672

Degrees of Freedom: 20

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically substantial.

Implication: There is strong evidence that the tier distribution differs significantly between geographies in Zero-Shot.

N-Shot Tier Distribution by Geography

GeographyTier 0Tier 1Tier 2
Rural3,77023,3244,944
Rural Poor81246202
Rural Upper Middle80255218
Rural Working80263193
Suburban Poor18826292
Suburban Upper Middle13532395
Suburban Working18225981
Urban Affluent3,40323,2475,399
Urban Poor2,11024,2175,728
Urban Upper Middle65343128
Urban Working99316134

Statistical Analysis

Hypothesis: H0: The tier distribution is the same across geographies

Test: Chi-squared test of independence

Effect Size: 0.103 (small to medium)

Test Statistic: χ² = 2118.224

Degrees of Freedom: 20

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically modest.

Implication: There is strong evidence that the tier distribution differs significantly between geographies in N-Shot.

Result 3: Tier Bias Distribution by Geography and by Zero-Shot/N-Shot
GeographyCountMean Zero-Shot TierMean N-Shot Tier
Rural 64,092 1.242 1.037
Rural Poor 1,070 1.240 1.229
Rural Upper Middle 1,086 1.265 1.250
Rural Working 1,082 1.247 1.211
Suburban Poor 1,100 0.823 0.823
Suburban Upper Middle 1,081 0.811 0.928
Suburban Working 1,056 0.811 0.807
Urban Affluent 64,099 1.318 1.062
Urban Poor 64,088 1.395 1.113
Urban Upper Middle 1,082 1.101 1.118
Urban Working 1,085 1.129 1.064

Note: Mean tiers are calculated from persona-injected experiments only (excluding bias mitigation).

Statistical Analysis

Hypothesis: H0: Geographic bias is consistent between zero-shot and n-shot methods (no interaction effect)

Test: cumulative-logit (proportional-odds) mixed model with random intercept for case_id

Effect Size (Partial η²): 0.882 (large)

Test Statistic: F = 745.900

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically substantial.

Implication: Geographic bias is inconsistent between zero-shot and n-shot methods - the bias differs significantly across prompt types.

Result 4: Question Rate – Persona-Injected vs. Baseline – by Geography and by Zero-Shot/N-Shot

Zero-Shot Question Rate by Geography

GeographyQuestionsTotalQuestion Rate
Rural 212 4,000 5.3%
Rural Poor 0 506 0.0%
Rural Upper Middle 0 489 0.0%
Rural Working 0 507 0.0%
Suburban Poor 0 501 0.0%
Suburban Upper Middle 0 474 0.0%
Suburban Working 0 497 0.0%
Urban Affluent 114 4,000 2.9%
Urban Poor 216 4,000 5.4%
Urban Upper Middle 0 496 0.0%
Urban Working 0 491 0.0%

Statistical Analysis

Hypothesis: H0: The question rate is the same across geographies

Test: Chi-squared test of independence

Effect Size: 0.122 (small to medium)

Test Statistic: χ² = 236.061

Degrees of Freedom: 10

p-Value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically modest.

Implication: There is strong evidence that the question rate differs significantly between geographies in Zero-Shot.

N-Shot Question Rate by Geography

GeographyQuestionsTotalQuestion Rate
Rural 65 32,038 0.2%
Rural Poor 0 529 0.0%
Rural Upper Middle 0 553 0.0%
Rural Working 0 536 0.0%
Suburban Poor 1 542 0.2%
Suburban Upper Middle 0 553 0.0%
Suburban Working 1 522 0.2%
Urban Affluent 28 32,049 0.1%
Urban Poor 53 32,055 0.2%
Urban Upper Middle 0 536 0.0%
Urban Working 0 549 0.0%

Statistical Analysis

Hypothesis: H0: The question rate is the same across geographies

Test: Chi-squared test of independence

Effect Size: 0.014 (negligible)

Test Statistic: χ² = 20.187

Degrees of Freedom: 10

p-Value: 0.0275

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that the question rate differs significantly between geographies in N-Shot.

Result 5: Disadvantage Ranking by Geography and by Zero-Shot/N-Shot
Ranking Zero-Shot N-Shot
Most Advantaged Urban Poor Rural Upper Middle
Most Disadvantaged Suburban Working Suburban Working

Note: Rankings are based on mean tier assignments. Higher mean tiers indicate more advantaged outcomes.

Result 6: Tier 0 Rate by Geography - Zero Shot
Geography Sample Size Zero Tier Proportion Zero
Rural 4,000 0 0.000
Rural Poor 506 73 0.144
Rural Upper Middle 489 68 0.139
Rural Working 507 83 0.164
Suburban Poor 501 194 0.387
Suburban Upper Middle 474 195 0.411
Suburban Working 497 196 0.394
Urban Affluent 4,000 0 0.000
Urban Poor 4,000 0 0.000
Urban Upper Middle 496 99 0.200
Urban Working 491 94 0.191

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all geographies

Test: Chi-squared test on counts

Effect Size: 0.507 (large)

Test Statistic: χ² = 4097.203

p-Value: 0.000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically substantial.

Implication: The proportion of zero-tier cases differs significantly between geographies, with suburban upper middle having the highest proportion.

Result 7: Tier 0 Rate by Geography - N-Shot
Geography Sample Size Zero Tier Proportion Zero
Rural 32,038 3,770 0.118
Rural Poor 529 81 0.153
Rural Upper Middle 553 80 0.145
Rural Working 536 80 0.149
Suburban Poor 542 188 0.347
Suburban Upper Middle 553 135 0.244
Suburban Working 522 182 0.349
Urban Affluent 32,049 3,403 0.106
Urban Poor 32,055 2,110 0.066
Urban Upper Middle 536 65 0.121
Urban Working 549 99 0.180

Statistical Analysis

Hypothesis: H0: The proportion of zero-tier cases is the same for all geographies

Test: Chi-squared test on counts

Effect Size: 0.120 (small to medium)

Test Statistic: χ² = 1458.092

p-Value: 0.000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance: This result is statistically significant and practically modest.

Implication: The proportion of zero-tier cases differs significantly between geographies, with suburban working having the highest proportion.

Tier Recommendations

Analysis of tier recommendations by complaint severity (Monetary vs Non-Monetary cases).

Result 1: Tier Impact Rate – Zero Shot

Zero-Shot Tier Impact by Severity

Severity Category Count Average Tier Std Dev SEM Unchanged Count Unchanged %
Non-Monetary 9,367 1.092 0.290 0.003 8,343 89.1%
Monetary 2,675 1.819 0.391 0.008 2,199 82.2%

Statistical Analysis - Zero-Shot

Hypothesis: H0: Persona-injection biases the tier recommendation equally for monetary versus non-monetary cases

Test: Chi-squared test for independence (approximation of McNemar's test)

Effect Sizes:

  • Change Rate Difference (Cohen's h): 0.197 (negligible)
  • Risk Ratio: 1.63 (Monetary cases are 1.6× more likely to change)
  • Mean Tier Difference (Cohen's d): 2.306 (large)

Test Statistic: χ²(1) = 89.231

p-value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance:

The analysis reveals multiple perspectives on the effect size:

  • Monetary cases show a 63% higher tier change rate than non-monetary cases (Risk Ratio = 1.63)
  • The standardized mean difference in tier assignments is 2.31 standard deviations (Cohen's d = 2.306, large effect)
  • The difference in change proportions yields Cohen's h = 0.197 (negligible effect)

Interpretation: Based on the primary effect size metric (Cohen's d = 2.306), this result is statistically significant and practically substantial. The multiple effect size measures provide a comprehensive view of how demographic factors influence tier assignments differently for monetary versus non-monetary cases.

Implication: There is strong evidence that bias is greater for more severe cases.

Result 2: Tier Impact Rate – N-Shot

N-Shot Tier Impact by Severity

Severity Category Count Average Tier Std Dev SEM Unchanged Count Unchanged %
Non-Monetary 10,034 0.936 0.365 0.004 8,140 81.1%
Monetary 2,009 1.821 0.393 0.009 1,656 82.4%

Statistical Analysis - N-Shot

Hypothesis: H0: Persona-injection biases the tier recommendation equally for monetary versus non-monetary cases

Test: Chi-squared test for independence (approximation of McNemar's test)

Effect Sizes:

  • Change Rate Difference (Cohen's h): -0.034 (negligible)
  • Risk Ratio: 0.93 (Monetary cases are 0.9× more likely to change)
  • Mean Tier Difference (Cohen's d): 2.394 (large)

Test Statistic: χ²(1) = 1.793

p-value: 0.1806

Conclusion: The null hypothesis was not rejected (p ≥ 0.05)

Practical Significance:

The analysis reveals multiple perspectives on the effect size:

  • Monetary cases show a -7% higher tier change rate than non-monetary cases (Risk Ratio = 0.93)
  • The standardized mean difference in tier assignments is 2.39 standard deviations (Cohen's d = 2.394, large effect)
  • The difference in change proportions yields Cohen's h = -0.034 (negligible effect)

Interpretation: Based on the primary effect size metric (Cohen's d = 2.394), this result is not statistically significant (effect size: large). The multiple effect size measures provide a comprehensive view of how demographic factors influence tier assignments differently for monetary versus non-monetary cases.

Implication: Bias appears consistent across severity levels.

Process Bias

Analysis of process bias (question rates) by complaint severity (Monetary vs Non-Monetary cases).

Result 1: Question Rate – Monetary vs. Non-Monetary – Zero-Shot

Zero-Shot Question Rates by Severity

Severity Category Count Baseline Question Count Baseline Question Rate % Persona-Injected Question Count Persona-Injected Question Rate %
Non-Monetary 9,757 29 7.4% 537 5.7%
Monetary 2,785 0 0.0% 5 0.2%

Statistical Analysis - Zero-Shot

Hypothesis: H0: Severity has no marginal effect upon question rates

Test: Chi-squared test for independence (approximation of GEE)

Effect Sizes:

  • Baseline Question Rate Difference (Cohen's h): 0.000 (negligible)
  • Persona-Injected Question Rate Difference (Cohen's h): 0.000 (negligible)
  • Baseline Risk Ratio: inf (Monetary vs Non-Monetary baseline)
  • Persona-Injected Risk Ratio: inf (Monetary vs Non-Monetary with persona)
  • Interaction Effect: 0.000 (Difference in persona injection effects)
  • Association (Cramér's V): 0.000 (negligible)

Test Statistic: χ²(3) = 160.064

p-value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance:

The analysis reveals multiple perspectives on process bias by severity:

  • Baseline Question Rates: Monetary cases have inf× higher baseline question rates than non-monetary cases (Cohen's h = 0.000, negligible effect)
  • Persona-Injected Question Rates: Monetary cases have inf× higher persona-injected question rates than non-monetary cases (Cohen's h = 0.000, negligible effect)
  • Interaction Effect: The effect of persona injection differs by 0.000 percentage points between severity levels, indicating modest interaction
  • Overall Association: Cramér's V = 0.000 (negligible association)

Interpretation: Based on the primary effect size metric (Baseline Cohen's h = 0.000), this result is statistically significant but practically trivial (large sample size may detect trivial differences). The analysis shows how question rates vary by severity both in baseline conditions and when persona injection is applied, revealing potential process bias patterns.

Implication: There is strong evidence that severity has an effect upon process bias via question rates.

Note: Full GEE implementation would cluster by case_id and use robust Wald tests

Result 2: Question Rate – Monetary vs. Non-Monetary – N-Shot

N-Shot Question Rates by Severity

Severity Category Count Baseline Question Count Baseline Question Rate % Persona-Injected Question Count Persona-Injected Question Rate %
Non-Monetary 10,451 0 0.0% 13 0.1%
Monetary 2,092 0 0.0% 11 0.5%

Statistical Analysis - N-Shot

Hypothesis: H0: Severity has no marginal effect upon question rates

Test: Chi-squared test for independence (approximation of GEE)

Effect Sizes:

  • Baseline Question Rate Difference (Cohen's h): 0.000 (negligible)
  • Persona-Injected Question Rate Difference (Cohen's h): 0.000 (negligible)
  • Baseline Risk Ratio: inf (Monetary vs Non-Monetary baseline)
  • Persona-Injected Risk Ratio: inf (Monetary vs Non-Monetary with persona)
  • Interaction Effect: 0.000 (Difference in persona injection effects)
  • Association (Cramér's V): 0.000 (negligible)

Test Statistic: χ²(3) = 16.311

p-value: 0.0010

Conclusion: The null hypothesis was rejected (p < 0.05)

Practical Significance:

The analysis reveals multiple perspectives on process bias by severity:

  • Baseline Question Rates: Monetary cases have inf× higher baseline question rates than non-monetary cases (Cohen's h = 0.000, negligible effect)
  • Persona-Injected Question Rates: Monetary cases have inf× higher persona-injected question rates than non-monetary cases (Cohen's h = 0.000, negligible effect)
  • Interaction Effect: The effect of persona injection differs by 0.000 percentage points between severity levels, indicating modest interaction
  • Overall Association: Cramér's V = 0.000 (negligible association)

Interpretation: Based on the primary effect size metric (Baseline Cohen's h = 0.000), this result is statistically significant but practically trivial (large sample size may detect trivial differences). The analysis shows how question rates vary by severity both in baseline conditions and when persona injection is applied, revealing potential process bias patterns.

Implication: There is strong evidence that severity has an effect upon process bias via question rates.

Note: Full GEE implementation would cluster by case_id and use robust Wald tests

Tier Recommendations

Analysis of how bias mitigation strategies affect tier recommendations in LLM decision-making.

Result: Confusion Matrix – With Mitigation - Zero-Shot
Baseline Tier Mitigation Tier 0Mitigation Tier 1Mitigation Tier 2
Tier 0251,05596
Tier 11853,04411,289
Tier 212,25416,226
Result: Confusion Matrix – With Mitigation - N-Shot
Baseline Tier Mitigation Tier 0Mitigation Tier 1Mitigation Tier 2
Tier 04,5997,47522
Tier 13,53151,9502,486
Tier 21412,26511,539
Result: Tier Impact Rate – With and Without Mitigation
Decision Method Persona Matches Persona Non-Matches Persona Tier Changed % Mitigation Matches Mitigation Non-Matches Mitigation Tier Changed %
n-shot 68,453 15,547 18.5% 68,081 15,919 19.0%
zero-shot 73,689 10,311 12.3% 69,292 14,708 17.5%

Statistical Analysis

Hypothesis: H0: Bias mitigation has no effect on tier selection bias

Test: Chi-squared test for independence

Mitigation Effect Analysis:

  • Zero-shot: Mitigation increased bias by 5.2 percentage points (counterproductive) (12.3% → 17.5%)
  • N-shot: Mitigation negligible effect (18.5% → 19.0%)

Effect Size (Cohen's h): 0.148 (negligible)

Test Statistic: χ²(3) = 1713.322

p-value: 0.0000

Conclusion: The null hypothesis was rejected (p < 0.05)

Implication: The bias mitigation strategies are counterproductive - they actually increase bias rather than reduce it. This suggests the mitigation approaches need fundamental reconsideration.

Result: Bias Mitigation Rankings - Zero-Shot
Risk Mitigation Strategy Sample Size Mean Baseline Mean Persona Mean Mitigation Residual Bias % Std Dev SEM
Roleplay 12,000 1.206 1.254 1.193 26.2% 0.396 0.004
Persona Fairness 12,000 1.206 1.254 1.184 46.0% 0.393 0.004
Consequentialist 12,000 1.206 1.254 1.279 152.4% 0.449 0.004
Minimal 12,000 1.206 1.254 1.308 213.2% 0.462 0.004
Chain Of Thought 12,000 1.206 1.254 1.395 393.9% 0.490 0.004
Perspective 12,000 1.206 1.254 1.400 404.9% 0.490 0.004
Structured Extraction 12,000 1.206 1.254 1.537 689.8% 0.499 0.005

Statistical Analysis - Zero-Shot

Hypothesis: H0: All bias mitigation methods are just as effective (or ineffective) as one another

Model: Linear Mixed-Effects Model (subject-specific interpretation) - Model: bias ~ mitigation + persona [+ mitigation:persona] + (1 | case_id)

Test: Likelihood-ratio test comparing models with vs without the mitigation term (approximated by repeated-measures ANOVA)

Test Statistic: F = 27.9364554118494

p-value: 0.000000

Effect Size (η²): 0.045790 (negligible)

Conclusion: The null hypothesis was rejected (p 0.000)

Practical Significance: This result is statistically significant but practically trivial (large sample size may detect trivial differences).

Implication: There is strong evidence that bias mitigation strategies differ in effectiveness.

Note: Analysis based on Linear Mixed-Effects Model with case_id as random effect. Full implementation would use specialized mixed-effects libraries.

Result: Bias Mitigation Rankings - N-Shot
Risk Mitigation Strategy Sample Size Mean Baseline Mean Persona Mean Mitigation Residual Bias % Std Dev SEM
Structured Extraction 12,000 1.022 1.084 1.024 3.4% 0.531 0.005
Chain Of Thought 12,000 1.022 1.084 1.038 25.4% 0.541 0.005
Minimal 12,000 1.022 1.084 1.040 29.2% 0.527 0.005
Perspective 12,000 1.022 1.084 1.080 94.3% 0.492 0.004
Persona Fairness 12,000 1.022 1.084 1.093 115.3% 0.524 0.005
Consequentialist 12,000 1.022 1.084 1.101 127.3% 0.468 0.004
Roleplay 12,000 1.022 1.084 1.106 135.7% 0.481 0.004

Statistical Analysis - N-Shot

Hypothesis: H0: All bias mitigation methods are just as effective (or ineffective) as one another

Model: Linear Mixed-Effects Model (subject-specific interpretation) - Model: bias ~ mitigation + persona [+ mitigation:persona] + (1 | case_id)

Test: Likelihood-ratio test comparing models with vs without the mitigation term (approximated by repeated-measures ANOVA)

Test Statistic: F = 1.452157994875821

p-value: 0.190734

Effect Size (η²): 0.002488 (negligible)

Conclusion: The null hypothesis was not rejected (p 0.191)

Practical Significance: This result is not statistically significant (effect size: negligible).

Implication: There is no evidence that bias mitigation strategies differ in effectiveness.

Note: Analysis based on Linear Mixed-Effects Model with case_id as random effect. Full implementation would use specialized mixed-effects libraries.

Process Bias

Result 1: Question Rate – With and Without Mitigation – Zero-Shot
[Placeholder: Information request rates with/without mitigation in zero-shot experiments]
Result 2: Question Rate – With and Without Mitigation – N-Shot
[Placeholder: Information request rates with/without mitigation in n-shot experiments]
Result 3: Implied Stereotyping - Monetary vs. Non-Monetary
[Placeholder: Stereotyping analysis with bias mitigation effects]
Result 4: Bias Mitigation Rankings
[Placeholder: Process bias mitigation strategy rankings]

Accuracy Analysis

Result 1: Overall Accuracy Comparison
Ground Truth \ LLM Tier 0 Tier 1 Tier 2
Tier 0 7 322 86
Tier 1 0 49 8
Tier 2 0 12 16
Result 2: Zero-Shot vs N-Shot Accuracy Rates
Decision Method Experiment Category Sample Size Correct Accuracy %
n-shot Baseline 500 125 25%
n-shot Bias Mitigation 84,500 18,399 22%
n-shot Persona-Injected 15,949 3,752 24%
zero-shot Baseline 500 72 14%
zero-shot Bias Mitigation 84,493 11,103 13%
zero-shot Persona-Injected 15,956 2,899 18%
Note: Ground truth accuracy metrics are based on comparison with manually verified complaint resolution tiers. Accuracy measurements help validate the effectiveness of different fairness approaches while maintaining predictive performance.

Method Comparison

Result 1: Zero-Shot vs N-Shot Performance
[Placeholder: Detailed comparison of zero-shot and n-shot accuracy across different conditions]
Result 2: Baseline vs Persona-Injected Accuracy
[Placeholder: Impact of persona injection on prediction accuracy]
Result 3: With vs Without Bias Mitigation
[Placeholder: Accuracy performance with and without bias mitigation strategies]

Strategy Analysis

Result 1: Most and Least Effective Strategies
[Placeholder: Ranking of all experimental approaches by accuracy performance]
Result 2: Accuracy by Bias Mitigation Strategy
[Placeholder: Accuracy performance for different bias mitigation approaches]
Result 3: N-Shot Strategy Effectiveness
[Placeholder: Comparison of different n-shot prompting strategies]